Snorkel: Rapid Training Data Creation with Weak Supervision

机译：浮潜：弱监督下的快速培训数据创建

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
相似文献
相关主题

摘要

Labeling training data is increasingly the largest bottleneck in deployingmachine learning systems. We present Snorkel, a first-of-its-kind system thatenables users to train state-of-the-art models without hand labeling anytraining data. Instead, users write labeling functions that express arbitraryheuristics, which can have unknown accuracies and correlations. Snorkeldenoises their outputs without access to ground truth by incorporating thefirst end-to-end implementation of our recently proposed machine learningparadigm, data programming. We present a flexible interface layer for writinglabeling functions based on our experience over the past year collaboratingwith companies, agencies, and research labs. In a user study, subject matterexperts build models 2.8x faster and increase predictive performance an average45.5% versus seven hours of hand labeling. We study the modeling tradeoffs inthis new setting and propose an optimizer for automating tradeoff decisionsthat gives up to 1.8x speedup per pipeline execution. In two collaborations,with the U.S. Department of Veterans Affairs and the U.S. Food and DrugAdministration, and on four open-source text and image data sets representativeof other deployments, Snorkel provides 132% average improvements to predictiveperformance over prior heuristic approaches and comes within an average 3.60%of the predictive performance of large hand-curated training sets.

机译：标注训练数据越来越成为部署机器学习系统的最大瓶颈。我们展示了Snorkel，这是首创的系统，它使用户能够训练最先进的模型而无需人工标记任何训练数据。取而代之的是，用户编写了表示任意启发式的标签函数，这些函数可能具有未知的准确度和相关性。通过合并我们最近提出的机器学习范例（数据编程）的第一个端到端实现，Snorkeldenoise可以在不获取基本事实的情况下对它们的输出进行量化。我们根据过去一年与公司，代理商和研究实验室的合作经验，提供了一个灵活的接口层，用于编写标签功能。在用户研究中，主题专家建立模型的速度提高了2.8倍，预测性能平均提高了45.5％，而人工标记的时间为7小时。我们在此新设置中研究了建模权衡，并提出了用于自动权衡决策的优化器，该优化器可使每次管道执行的速度提高1.8倍。在与美国退伍军人事务部和美国食品与药物管理局的两次合作中，在代表其他部署的四个开源文本和图像数据集上，Snorkel的预测性能比以前的启发式方法平均提高了132％，并且在平均水平之内大型手策训练集的预测性能的3.60％。

著录项

作者
Ratner, Alexander; Bach, Stephen H.; Ehrenberg, Henry; Fries, Jason; Wu, Sen; Ré, Christopher;
展开▼
作者单位

展开▼
年度 2017
总页数
原文格式 PDF
正文语种
中图分类

相似文献

外文文献
中文文献
专利

1. Snorkel: rapid training data creation with weak supervision [J] . Ratner Alexander, Bach Stephen H., Ehrenberg Henry, The VLDB journal . 2020,第2a3期

机译：呼吸管：在缺乏监督的情况下快速创建培训数据
2. Mining relational data from text: From strictly supervised to weakly supervised learning [J] . Zhu Zhang Information Systems . 2008,第3期

机译：从文本中挖掘关系数据：从严格监督到弱监督学习
3. Analysis of training data using clustering to improve semi-supervised self-training [J] . Piroonsup N., Sinthupinyo S. Knowledge-Based Systems . 2018,第MARa1期

机译：使用聚类分析改进半监督自训练的训练数据
4. Snorkel: Rapid Training Data Creation with Weak Supervision [C] . Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, International conference on very large data bases . 2018

机译：呼吸管：通过弱监督快速创建训练数据
5. Maximizing the Data Utilization Efficiency in Medical Imaging Diagnosis: From Full Supervision to Weak Supervision [D] . Li, Xiaomeng. 2019

机译：最大化医学成像诊断的数据利用效率：从全面监督到弱势监督
6. Snorkel: Rapid Training Data Creation with Weak Supervision [O] . Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, -1

机译：呼吸管：通过弱监督快速创建训练数据
7. Data Selection for training Semantic Segmentation CNNs with cross-dataset weak supervision [O] . Panagiotis Meletis, Rob Romijnders, Gijs Dubbelman 2019

机译：具有跨数据集弱监管培训语义分割CNN的数据选择

Snorkel: Rapid Training Data Creation with Weak Supervision

摘要

著录项

相似文献

相关主题

期刊订阅